Analysis of Crime within the District of Columbia

Introduction

This data set contains information on 153687 felony arrests from the District of Columbia between the years 2013-2017. The information provided by this data set is readily available on the District of Columbia’s Police Department website. According to the dossier attached to this dataset, a felony arrest is defined as the police taking into custody a person that is suspected of having committed a crime. Although an arrest occurred, information about the conviction for the crime was not included. As a result, the modeling performed in this report has been done with the assumption that there is a 100% conviction rate for all arrests. This is unlikely to occur so some deviation from the model can be expected.

In this dataset, the crimes recorded here include boating violations, disorderly conduct, arson, homicide and many more. These crimes range in severity from felony convictions to misdemeanors. Of the crimes committed, 34133 were the result of the arrest of females, 119373 were males and 181 of unknown gender orientation. Because there are far more males present in this data set, the model built here may be more applicable to the male population. The mean age of those arrested was 34.8 years old.

For this report, we have sought to answer the question: “Is it possible to accurately profile who is likely to commit a crime in DC?” In the last few years, the use of racial profiling by law enforcement has been a controversial topic. Historically, police officers have had a large presence within primarily African American and Latino communities under the assumption that these areas are plagued with crime. As a result of this, there have many incidences of unlawful arrests and police brutality.

Structure of the Data Set

Within this dataset there was originally 28 variables. ObjectID was removed from this dataset because it corresponded only to the row number. Similarly, CCN and Arrest_Number were removed because these values corresponded to specifics about the arrest that were encrypted due to privacy concerns. After scrubbing the data, the variables left are shown below.

## 'data.frame':    153687 obs. of  25 variables:
##  $ OBJECTID         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ YEAR             : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ MONTH            : int  1 1 1 1 2 2 2 2 3 3 ...
##  $ DAY              : int  5 9 9 28 7 9 23 27 8 14 ...
##  $ HOUR             : int  11 15 17 10 7 4 18 6 15 9 ...
##  $ AGE              : int  34 23 44 23 22 50 30 23 27 27 ...
##  $ DEFENDANT_PSA    : int  0 403 0 505 103 0 602 505 603 607 ...
##  $ DEFENANT_ISTRICT : int  0 4 0 5 1 0 6 5 6 6 ...
##  $ Race             : Factor w/ 4 levels "1","2","3","4": 1 2 3 3 3 3 3 3 3 3 ...
##  $ ETHNICITY        : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 NA 1 1 1 ...
##  $ SEX              : Factor w/ 3 levels "1","2","3": 1 2 2 2 2 2 2 2 3 2 ...
##  $ CATEGORY         : Factor w/ 29 levels "1","2","3","4",..: 2 6 22 11 22 1 1 1 6 2 ...
##  $ DESCRIPTION      : Factor w/ 997 levels "1","2","3","4",..: 664 816 737 910 737 883 904 904 751 429 ...
##  $ ARREST_PSA       : int  702 403 703 504 103 604 601 503 502 405 ...
##  $ ARREST_DISTRICT  : int  7 4 7 5 1 6 6 5 5 4 ...
##  $ ARREST_BLOCKX    : int  402600 398200 400700 400600 399000 406000 404200 402300 400000 400700 ...
##  $ ARREST_BLOCKY    : int  131700 143300 132800 139500 137300 135100 137500 138600 139300 142400 ...
##  $ OFFENSE_BLOCKY   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ OFFENSE_BLOCKX   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ OFFENSE_PSA      : int  NA NA 706 NA 101 NA NA NA NA NA ...
##  $ OFFENSE_DISTRICT : int  NA NA 7 NA 1 NA NA NA NA NA ...
##  $ ARREST_LATITUDE  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ ARREST_LONGITUDE : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ OFFENSE_LATITUDE : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ OFFENSE_LONGITUDE: num  NA NA NA NA NA NA NA NA NA NA ...
##          OBJECTID              YEAR             MONTH               DAY 
##                 0                 0                 0                 0 
##              HOUR               AGE     DEFENDANT_PSA  DEFENANT_ISTRICT 
##                 0                 0                 0                 0 
##              Race         ETHNICITY               SEX          CATEGORY 
##                 0             45899                 0                 0 
##       DESCRIPTION        ARREST_PSA   ARREST_DISTRICT     ARREST_BLOCKX 
##                 0               813               813              1773 
##     ARREST_BLOCKY    OFFENSE_BLOCKY    OFFENSE_BLOCKX       OFFENSE_PSA 
##              1773               746               746               535 
##  OFFENSE_DISTRICT   ARREST_LATITUDE  ARREST_LONGITUDE  OFFENSE_LATITUDE 
##               527              2274              2274               746 
## OFFENSE_LONGITUDE 
##               746

Out of the 25 variables present in the cleaned up data set, 12 did not have any missing values. Twelve variables presented up to 2300 missing values. Considering the total number of points in this particular file (153,687), the aformentioned missing values is considerably small. Only one of the variables (Ethnicity) had a large number of missing values (45,899) and, for this reason, was not used in the statistical studies.

Correlation of Age and Type of Crime

## 
## Call:
## glm(formula = SEX ~ CATEGORY, family = "binomial", data = sex_category)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.964   0.041   0.047   0.050   0.292  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    6.9580     0.2501   27.82  < 2e-16 ***
## CATEGORY2      0.2037     0.3485    0.58  0.55885    
## CATEGORY3     -0.8807     0.4792   -1.84  0.06607 .  
## CATEGORY4      0.8976     1.0310    0.87  0.38398    
## CATEGORY5     -0.2560     0.3257   -0.79  0.43187    
## CATEGORY6     -0.2293     0.3919   -0.59  0.55839    
## CATEGORY7      0.1863     0.5126    0.36  0.71629    
## CATEGORY8     -0.3466     0.5594   -0.62  0.53550    
## CATEGORY9     -0.3305     0.5127   -0.64  0.51919    
## CATEGORY10    -0.1405     0.3048   -0.46  0.64488    
## CATEGORY11    -0.1645     0.4332   -0.38  0.70416    
## CATEGORY12    -0.7519     0.3822   -1.97  0.04914 *  
## CATEGORY13    -0.8496     0.4334   -1.96  0.04997 *  
## CATEGORY14    -0.5889     0.7506   -0.78  0.43269    
## CATEGORY15    -0.7485     0.4535   -1.65  0.09887 .  
## CATEGORY16     0.2839     0.5592    0.51  0.61165    
## CATEGORY17    -1.2500     0.5598   -2.23  0.02555 *  
## CATEGORY18    -1.2475     0.5598   -2.23  0.02585 *  
## CATEGORY19    -0.4069     1.0315   -0.39  0.69324    
## CATEGORY20    -0.9035     1.0319   -0.88  0.38127    
## CATEGORY21    12.6081  1007.2056    0.01  0.99001    
## CATEGORY22     0.0304     1.0313    0.03  0.97645    
## CATEGORY23     0.1161     1.0312    0.11  0.91033    
## CATEGORY24    -0.0887     0.5593   -0.16  0.87399    
## CATEGORY25    12.6081   512.0959    0.02  0.98036    
## CATEGORY26    12.6081   533.0567    0.02  0.98113    
## CATEGORY27    -1.6850     1.0333   -1.63  0.10296    
## CATEGORY28    -0.4567     1.0315   -0.44  0.65797    
## CATEGORY29    -3.8225     1.0517   -3.63  0.00028 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2803.2  on 153686  degrees of freedom
## Residual deviance: 2769.1  on 153658  degrees of freedom
## AIC: 2827
## 
## Number of Fisher Scoring iterations: 18

By looking at the results from the Logit analysis, we can see that only 3 out of 28 categories are statistically significant. These categories are liquor Law Violations, damage to Property, burglary, offenses against family & children, and arson. As for the three statistically significant variables, category 29 has the lowest p-value suggesting a strong association of the sex of the offender with the probability of comitting arson.

The histogram and density lines indicate that the age at which the crimes are committed in DC seems to follow a bimodal distribution. Most of the crimes are committed by people with 18 years old. There is a steep decrease on crimes committed by people with ages between 25 and 42 years old. However, this trend is not sustained and the number of crimes committed by people with ~ 50 years old increases again and then decreases rapidly.

The Box-Whiskers Plot is very useful to understand the data distribution and to vizualize outliers. Here we can see that all but one category (Vending violations) is right-skewed. In a positive-skewed distribution, the mean and median are shifted to higher values when compared to the mode. In the plot above, the right-skewed distribution is most likely caused by the outliers.

Many assumptions can be made from this Box-Whiskers plot. For instance, robbery, Weapon Violations, homicide and gambling were mostly committed by younger people (late 20’s early 30’s). An opposite observation can be seen for Vending Violations, which was mostly commited by people in their 40’s and 50’s.

Violent Crimes Committed in Foggy Bottom/Great DC

The map above contains the crime committed between 2013-2017. The map is zoomed in Foggy Botton/Dupont Circle/George Town/downtown areas and the red, green, and blue circle corresponds to “Robbery”, Sex Abuse, and Homicide respectivally. Foggy Bottom and downtown is well knwon for being relatively safe. This map allows to visualize where the crimes were committed in the investigated area. We can note that robbery is fairly commom in different areas of DC, including GWU premises. However, only few sex abuse and homicides were reported in that area.

Sadly, the zoomed out map shows a different trend in DC’s neighborhoods. Sex Abuse and homicide crimes increases tremendously in the aformentioned areas. There is an extreme rise in the numbers of homicides and sex abuse in areas like Anacostia, Takoma Park, and Brentwood vicinities. However, the number of sex abuse is notably higher than homicides in the Southwest region.

Do people commit crime in their own neighborhood?

To further understand our SMART question, we wanted to examine whether people are more likely to commit a felony offense in their own neighborhood. Based on our findings, 37533 people, or 24%, of those that committed a crime in DC did so in their own neighborhood, while 76% of crimes were committed by residence of DC in a different neighborhood or by out-of-staters. In order to further understand these findings, we also sought to determine if the crimes committed closer to home were more violent offenses, white collar crimes or misdemeanors. For our purposes, we have defined violent crimes as including Aggravated assault, Assault on a Police Officer, Assault with a Dangerous Weapon, Sex Offenses, Kidnapping, Sex Abuse, Homicide, Weapon violations and Arson.

Interestingly, we found that 24.9% of the arrests made between 2013-2017 were of people who committed said crime within in their own neighborhood. Of this percentage, 51.3% of these crimes were the result of a violent crime. In comparison, 26.5% of the arrests documented here were made by people who lived outside DC or were out-of-staters, while the remaining portion was committed by DC residents who committed a crime in a different neighborhood. Of the crimes that out-of-staters were convicted of only 23.6% were violent. This could indicate that people are more likely to commit a violent crime within their own neighborhood. Similarly, the age spread for the violent crimes committed by those that live within DC in their own neighborhoods tended to be larger than for those who committed a crime in a different neighborhood.

To better understand the relationship of age and the type of crime committed we have created a density plot that is shown below. Based on this density plot crimes such as narcotics, arson, sex offenses, and gambling are more likely to be committed by people that are younger in age. Felonies involving offenses against family and children, liquor laws, and damage to property occur later in life. If this trend is true, younger people should be profiled more by police as having potentially committed a violent crime.

Building a Multivariate Model

Here, we have sought to determine whether police can accurately determine the type of crime a person is likely to commit based on variables present in this dataset, such as age, race, and ethnicity. First, we have sought to pick the features of this model through using the Bayesian information criterion (BIC). In analyzing the BIC plot shown below, we sought to use the model with the fewest predictors and the lowest BIC score. Based on this, the model that we will build here is hour, age, defendant PSA, race, sex, arrest PSA, arrest district, offense PSA and offense district.

To develop a multivariate model, we have split the data set into a training and test set. The training data set contained 67% of the values, while the test set contained 33% of the values. Observations were separated randomly, and the data set was scaled to the center.

The multivariate model was produced using the lda() function from the MASS package. Linear discriminate analysis (LDA) seeks to find a linear combination of features that can be used to characterize two or more classes of an event. Essentially this model is attempting to recognize a pattern between the physical variables to predict the crime committed. The coefficients of linear discriminants of this model are displayed below. Each LD can be multiplied by the predictor variable to determine the score for that respondent, which can then be used to compute the posterior probability of class membership.

##                       LD1     LD2      LD3      LD4      LD5      LD6      LD7
## Hour             9.92e-01  2.0004 6.39e-01 6.77e-01 6.59e-01 9.86e-01 7.85e-01
## Age              1.20e+00  0.7034 5.82e-01 5.90e-01 1.63e+00 9.75e-01 1.34e+00
## Defendant_PSA    1.19e+00  1.2870 1.68e+00 6.65e-01 1.83e+00 1.03e+00 5.79e-01
## Race             1.23e+00  1.4889 1.34e+00 9.90e-01 9.39e-01 1.39e+00 2.31e+00
## Sex              9.88e-01  0.6062 1.36e+00 5.46e-01 5.70e-01 1.12e+00 9.62e-01
## Offense_PSA      1.32e-11  0.0171 1.46e-06 2.95e+02 1.40e+04 1.44e+37 1.65e-08
## Offense_District 2.66e+10 53.7871 6.29e+05 3.27e-03 7.50e-05 5.39e-38 7.95e+07

Conclusion

Linear discriminate analysis assumes that the density of the data is gaussian and that all classes have covariance. This model has been shown to be well suited for multi-class analysis and similar to PCA it can be used as a dimensionality reduction technique. Despite this, when the amount of data for each arrest type is imbalanced in the training set, the model may be unable to accurately classify the observations in the test set. This is an accuracy problem that we will experience when using our model, because there are far less violent crimes in comparison to misdemeanors such as narcotics. Also, the LDA model requires a defined dimension. Higher-order interactions that may exist between the arrest types may therefore not be captured accurately by this model.

In conclusion, this data set has been found to have a wide variety of arrest types. Based on the findings in this report, we do not feel that the DC police are able to predict the profile of a person who is more likely to commit a crime in DC. There are two reasons why this is impossible. First, based on the BIC this model needs to include features such as the arrest district, and a person’s one police district. Often when police are profiling a potential criminal, they are making their arrests based on physical descriptions such as race and approximate age. As a result, our model would not actually be able to be employed by officers on the street. Similarly, in agreement with the literature “community policing” often involves some type of bias. To demonstrate the ability of a police officer to predict the crime a person may have committed the model would have to include some type of bias variable. Based on this current data set, community policing should not be used.

Ashley Frankenfield, Gessica Vasconcelos

4/3/2020